FEMPI: A Lightweight Fault-tolerant MPI for Embedded Cluster Systems
نویسندگان
چکیده
Ever-increasing demands of space missions for data returns from their limited processing and communications resources have made the traditional approach of data gathering, data compression, and data transmission no longer viable. Increasing on-board processing power by providing high-performance computing (HPC) capabilities using commercial-off-the-shelf (COTS) components is a promising approach that significantly increases performance while reducing cost. However, the susceptibility of COTS components to single-events upset (SEU) is a concern demanding fault-tolerant system infrastructure. Among the components of this infrastructure, message-passing middleware based upon the Message Passing Interface (MPI) standard is essential, so as to support and provide a nearly effortless transition for earth and space science applications in MPI from groundbased computational clusters to HPC systems in space. In this paper, we present the design of a fault-tolerant MPIcompatible middleware for embedded cluster computing known as FEMPI (Fault-tolerant Embedded MPI). We also present preliminary performance results with FEMPI on a COTS-based, embedded cluster system prototype.
منابع مشابه
SHIELD: A Fault-Tolerant MPI for an Infiniband Cluster
Today’s high performance cluster computing technologies demand extreme robustness against unexpected failures to finish aggressively parallelized work in a given time constraint. Although there has been a steady effort in developing hardware and software tools to increase fault-resilience of cluster environments, a successful solution has yet to be delivered to commercial vendors. This paper pr...
متن کاملTowards Middleware for Fault-Tolerance in Distributed Real-Time and Embedded Systems
Distributed real-time and embedded (DRE) systems often require support for multiple simultaneous quality of service (QoS) properties, such as real-timeliness and fault tolerance, that operate within resource constrained environments. These resource constraints motivate the need for a lightweight middleware infrastructure, while the need for simultaneous QoS properties require the middleware to ...
متن کاملFailure Resilient Heterogeneous Parallel Computing Across Multidomain Clusters
We propose lightweight middleware solutions that facilitate and simplify the execution of failure-resilient MPI programs across multidomain clusters. The system described in this paper leverages H2O, a distributed metacomputing framework, to route MPI message passing across heterogeneous aggregates located in different administrative or network domains. MPI programs instantiate a specially writ...
متن کاملMPI/RT - An Emerging Standard for High-Performance Real-Time Systems
The last several years saw an emergence of standardization activities for real-time systems including standardization of operating systems (series of POSIX standards [1]), of communication for distributed (POSIX.21 [15]) and parallel systems (MPI/RT [6] and real-time object management (real-time CORBA [14]). This article describes the ongoing work of real-time message passing interface (MPI/RT)...
متن کاملAutomatic Fault - Tolerant MPI
High performance computing platforms such as Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing libraries in HPC applications. These two trends raise the need for fault-tolerant MPI. The MPICH-V project focuses on designing, implementing and comparing several automatic fault-tolerant protocols for MPI applicati...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006